Beijing, China - May 24-26, 2016

Outline

  • Session 1: Motivation, why and how to think about data, and getting started with R
  • Session 2: Making basic plots, grammar of graphics, good practices
  • Session 3: Advanced graphics, layering, using maps

What is Exploratory Data Analysis?

  • EDA is concerned about letting the data speak, and discovering what is in the data as opposed to predicting from the data
  • Initial data analysis is a part of EDA, where data quality and model assumptions are checked using descriptive statistics, prior to modeling: "The first thing to do with data is to look at them…. usually means tabulating and plotting the data in many different ways to `see what's going on'. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later." Crowder and Hand, 1990
  • Relax the focus on the problem statement and explore broadly different aspects of the data.

Tukey's contributions

  • EDA complements model building: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data" Tukey, 1986.
  • "The greatest value of a picture is when it forces us to notice what we never expected to see." Tukey, 1977 Plotting data is an important component of EDA.

Examples

These are two examples of data sets that I've analysed in recent years, and learned a lot by making plots.

  • Education: Every four years students across the globe are tested on their math, reading and science skills and surveyed about their educational experience and social environment, as part of assessing workforce readiness of teenagers. http://www.oecd.org/pisa/pisaproducts/
  • Climate: Monitors and sensors are located across the globe measuring aspects of the environment, e.g. Scripps Inst. of Oceanography

The data can be pulled from the web, and the code that produced the plots in these slides is in the .Rmd version, so that you can reproduce this work yourself.

Education

  • 485,490 students math, science and reading test scores
  • 65 countries, between 100-1500 schools in each
  • Student questionnaires about their environment (635 vars)
  • Parents surveyed on work, life, income (143 vars)
  • Principals provide information about their schools (291 vars)

Math Gender Gap

Math Gender Gap

  • Primary question: Are boys better than girls at math, ON AVERAGE?
  • Secondary questions: How does studying, truancy, parents, possessions, number of TVs in the household, … affect the scores?

Calculations

  • Compute the weighted mean for each of girls and boys, for each country
  • Difference the means
  • Take bootstrap samples, and recompute to reproduce confidence intervals
  • Plot the country mean difference in order of largest to smallest

Individual Scores

  • How do individuals look?
  • We will find the minimum and maximum values for girls and boys for each country, and plot these.

Reading Scores

  • Do girls do better than boys ON AVERAGE on reading tests?
  • Repeat the same analysis with reading scores instead of math.

Time Reported Studying Out of School

  • Compute the math mean for each hour of study
  • Plot mean by hour by country, join by a line, to examine trend

What's the Deal About Carbon Dioxide?

  • "Scientific consensus states that carbon emissions must be reduced by 80% by 2050 to avoid temperature rise of more than 2\(^o\)C." Carbon Neutral
  • Carbon offsets: Carbon offsetting is the use of carbon credits to enable businesses to compensate for their emissions.
  • Kyoto protocol in 1992, attempt to get international cooperation to reduce emissions.

Carbon Dioxide Data

  • Data is collected at a number of locations world wide.
  • See Scripps Inst. of Oceanography
  • Let's pull the data from the web and take a look …
  • Recordings from South Pole (SPO), Kermadec Islands (KER), Mauna Loa Hawaii (MLF), La Jolla Pier, California (LJO), Point Barrow, Alaska (PTB).

What Do We Learn?

  • CO\(_2\) is increasing, and it looks like it is exponential increase. I really expected that the concentration would have flattened out with all of the efforts to reduce carbon emissions.
  • The same trend is seen at every location - REALLY? Need some physics to understand this.
  • Some stations show seasonal pattern - actually the more north the more seasonality - WHY?

These Slides

  • This is a "live" document
  • Code and explanations together
  • Run the software to make the calculations on the data, and produce nice presentation, or Word or pdf or html document

(Slides and material for this workshop can be found at http://dicook.github.io/China-R.)

Big thanks to Xie Yihui, 谢益辉 for these tools!

Why R?

"R has become the most popular language for data science and an essential tool for Finance and analytics-driven companies such as Google, Facebook, and LinkedIn." Microsoft 2015

R is …

  • Free to use
  • Extensible Over 7300 user contributed add-on packages currently on CRAN! More than 10000 on github.com
  • Powerful With the right tools, get more work done, faster.
  • Flexible Not a question of can, but how.
  • Frustrating Flexibility comes at a cost

R does …

  • Graphics, statistics, machine learning, etc.
  • Data acquisition, munging, management
  • Literate programming (dynamic reports)
  • Web applications

RStudio is …

From Julie Lowndes:

If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.

The RStudio IDE

  • Source editor: (1) Docking station for multiple files, (2) Useful shortcuts ("Knit"), (3) Highlighting/Tab-completion, (4) Code-checking (R, HTML, JS), (5) Debugging features
  • Console window: (1) Highlighting/Tab-completion, (2) Search recent commands
  • Other tabs/panes: (1) Graphics, (2) R documentation, (3) Environment pane, (4) File system navigation/access, (5) Tools for package development, git, etc

Data Analysis Cycle

Get Started

  • Want to work along with me?
  • Create a project for this workshop, start a new .Rmd log book to contain your work
  • Tackle the YOUR TURNs alone or with a partner

Create a Project

Create a project to contain all of the material covered in this set of tutorials:

  • File -> New Project -> New Directory -> Empty Project

Hello R Markdown!

  • File -> New File -> R Markdown -> OK -> Knit HTML

What is R Markdown?

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).

Getting data

Data can be found in R packages

data(economics, package = "ggplot2")
# data frames are essentially a list of vectors
str(economics)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    574 obs. of  6 variables:
#>  $ date    : Date, format: "1967-07-01" "1967-08-01" ...
#>  $ pce     : num  507 510 516 513 518 ...
#>  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
#>  $ psavert : num  12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
#>  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
#>  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...

These are not usually kept up to date but are good for practicing your analysis skills on.

Getting Data

Or in their own packages

library(gapminder)
str(gapminder)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
#>  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
#>  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
#>  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
#>  $ gdpPercap: num  779 821 853 836 740 ...

More contemporary sets here, but not updated frequently.

Getting Data

I primarily use the readr package for reading data now. It mimics the base R reading functions but is implemented in C so reads large files quickly, and it also attempts to identify the types of variables.

ped <- read_csv("../data/Pedestrian_Counts.csv")
kable(head(ped))
Date_Time Sensor_ID Sensor_Name Hourly_Counts
01-MAY-2009 00:00 4 Town Hall (West) 209
01-MAY-2009 00:00 17 Collins Place (South) 28
01-MAY-2009 00:00 18 Collins Place (North) 36
01-MAY-2009 00:00 16 Australia on Collins 22
01-MAY-2009 00:00 2 Bourke Street Mall (South) 52
01-MAY-2009 00:00 1 Bourke Street Mall (North) 53

Pulling data together yourself, or compiled by someone else.

Your Turn

  • Look at the document economics data in the ggplot2 package. Can you think of questions you could answer using these variables?

  • Write these into your .Rmd file.

Your Turn

  • Read the documentation for gapminder data. Can you think of questions you could answer using these variables?

  • Write these into your .Rmd file.

Your Turn

  • Read the documentation for pedestrian sensor data. Can you think of questions you could answer using these variables?

  • Write these into your .Rmd file.

Some Basics

  • Assign values to a name with <- is called gets
  • n_max=50 option to the read_csv function reads just the first 50 lines
  • dim reports the dimensions of the data matrix
  • colnames shows the column names (you can see these by looking at the object in the RStudio environment window, too)
  • $ specify the column to use
  • typeof indicates the information format in the column, what R thinks
  • complex variable names containing spaces, etc, can be used, as long as they are wrapped in single quotes
    workers$`Claim Type`

Data Types

  • list's are heterogeneous (elements can have different types)
  • data.frame's are heterogeneous but elements have same length
  • vector's and matrix's are homogeneous (elements have the same type), which would be why c(1, "2") ends up being a character string.
  • function's can be written to save repeating code again and again

  • If you'd like to know more, see Hadley Wickham's online chapters on data structures and subsetting

Operations

  • Use built-in vectorized functions to avoid loops
set.seed(1000)
x <- rnorm(6)
x
#> [1] -0.446 -1.206  0.041  0.639 -0.787 -0.385
sum(x + 10)
#> [1] 58
  • R has rich support for documentation, see ?sum

  • Use [ to extract elements of a vector.
x[1]
#> [1] -0.45
x[c(T, F, T, T, F, F)]
#> [1] -0.446  0.041  0.639

  • Extract named elements with $, [[, and/or [
x <- list(
  a = 10,
  b = c(1, "2")
)
x$a
#> [1] 10
x[["a"]]
#> [1] 10
x["a"]
#> $a
#> [1] 10

Examining 'structure'

  • str() is a very useful R function. It shows you the "structure" of (almost) any R object (and everything in R is an object!!!)
str(x)
#> List of 2
#>  $ a: num 10
#>  $ b: chr [1:2] "1" "2"

Missing Values

  • NA is the indicator of a missing value in R
  • Most functions have options for handling missings
x <- c(50, 12, NA, 20)
mean(x)
#> [1] NA
mean(x, na.rm=TRUE)
#> [1] 27

Counting Categories

  • the table function can be used to tabulate numbers
table(ped$Sensor_Name)
#> 
#>                           Alfred Place 
#>                                  12365 
#>                   Australia on Collins 
#>                                  48310 
#>                         Birrarung Marr 
#>                                  44904 
#>             Bourke St-Russel St (West) 
#>                                  18573 
#>            Bourke St-Russell St (West) 
#>                                   2208 
#>             Bourke Street Mall (North) 
#>                                  47254 
#>             Bourke Street Mall (South) 
#>                                  59205 
#>         Chinatown-Lt Bourke St (South) 
#>                                  20911 
#>          Chinatown-Swanston St (North) 
#>                                  19797 
#>                            City Square 
#>                                  17860 
#>                  Collins Place (North) 
#>                                  59205 
#>                  Collins Place (South) 
#>                                  54886 
#>                      Flagstaff Station 
#>                                  59205 
#>        Flinders St-Elizabeth St (East) 
#>                                  19917 
#>                   Flinders St-Spark La 
#>                                  14470 
#>           Flinders St-Spring St (West) 
#>                                  15574 
#>         Flinders St-Swanston St (West) 
#>                                   9503 
#>      Flinders Street Station Underpass 
#>                                  59205 
#>          Grattan St-Swanston St (West) 
#>                                   6191 
#>                    Lonsdale St (South) 
#>                                   2208 
#>           Lonsdale St-Spring St (West) 
#>                                   9047 
#>                Lonsdale Street (South) 
#>                                  16630 
#>                        Lygon St (East) 
#>                                   8759 
#>                        Lygon St (West) 
#>                                   2208 
#>                    Lygon Street (West) 
#>                                  18070 
#>                      Melbourne Central 
#>                                  58773 
#> Melbourne Convention Exhibition Centre 
#>                                  20781 
#>           Monash Rd-Swanston St (West) 
#>                                   7006 
#>                               New Quay 
#>                                  56278 
#>                         Princes Bridge 
#>                                  59205 
#>                        Queen St (West) 
#>                                   8144 
#>          QV Market-Elizabeth St (West) 
#>                                  20542 
#>                      QV Market-Peel St 
#>                                  21189 
#>                       Sandridge Bridge 
#>                                  59157 
#>                 Southern Cross Station 
#>                                  59205 
#>          Spencer St-Collins St (North) 
#>                                  20039 
#>          Spencer St-Collins St (South) 
#>                                  20637 
#>             St Kilda-Alexandra Gardens 
#>                                  18886 
#>                          State Library 
#>                                  53616 
#>                        The Arts Centre 
#>                                  20311 
#>           Tin Alley-Swanston St (West) 
#>                                   7004 
#>                       Town Hall (West) 
#>                                  58869 
#>                         Victoria Point 
#>                                  58773 
#>                        Waterfront City 
#>                                  59205 
#>                            Webb Bridge 
#>                                  58533

Some Oddities

  • Yes, + is a function (which calls compiled C code)
`+`
#> function (e1, e2)  .Primitive("+")
  • What's that? You don't like addition? Me neither!
"+" <- function(x, y) "I forgot how to add"
1 + 2
#> [1] "I forgot how to add"
  • But seriously, don't "overload operators" unless you know what you're doing
rm("+")

Getting Help on the Web

  • Reading documentation only gets you so far. What about finding function(s) and/or package(s) to help solve a problem???

  • Google! (I usually prefix "CRAN" to my search; others might suggest http://www.rseek.org/

  • Ask your question on a relevant StackExchange outlet such as http://stackoverflow.com/ or http://stats.stackexchange.com/

  • It's becoming more and more popular to bundle "vignettes" with a package (dplyr has awesome vignettes)

browseVignettes("dplyr")

Your Turn

  1. Read in the OECD PISA data
  2. Tabulate the countries (CNT)
  3. Extract the values for Australia (AUS) and Shanghai (QCN)
  4. Compute the average and standard deviation of the reading scores (PV1READ), for each country

Australian Election Data

This is a current project (joint with Ben Marwick, Rob Hyndman, Heike Hofmann, Carson Sievert, Nathaniel Tomasetti). Code and data are provided to study the electoral maps and system.

  • Spatial boundaries of electorates
  • Results of the 2013 elections
  • 2010 Census data aggregated to electorate level

There is a shiny app that facilitates interactive exploration of the data.

Next session

  • Wrangling your data into shape
  • Basic plotting of data

Credits